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SUMMARY 


This paper establishes bounds on the probability of system failure for fault- 
tolerant systems of the type used f for example, in aviation control. Event series 
leading to system failure are assumed to follow a semi-Markov model in which the 
potential sojourn times associated with component failures have exponential distri- 
butions and those associated with system responses have distributions with unspeci- 
fied form. A product form of the bounds is derived by using a model that provides 
for multiple competing system responses to component failures. The general form of 
the bounds is expressed in terms of integral factors that depend on component fail- 
ure rates and the distributions of system response times. The bounds are also 
expressed in terms of percentiles, conditional mean response times, and certain 
transition probabilities. The accuracy of the bounds is discussed both analytically 
and in terms of an example system. 


1 . INTRODUCTION 

In recent years, the desire to improve performance, reliability, and safety of 
commercial and military aircraft systems has led to the increased use of fault- 
tolerant hardware and software control systems. Some control systems now employ 
tests to detect and identify failed components and, if failures are identified, the 
system may reconfigure to exclude information provided by failed components. The 
techniques for doing this (Montgomery, 1975; Smith et al. , 1977; and Willsky, 1976) 
are often based on hardware duplication of like components with comparison monitor- 
ing for failure detection, voting or averaging to mask errors from failed compo- 
nents, and regrouping and repeated comparison for failure identification; also, 
reconfiguration may take the form of switchovers to spare components or switchovers 
to operational components. 

Early recognition that combinatorial assessment methods would not readily 
account for the effect on system reliability of such state-dependent system 
responses has led to the use of multistate point process models. Several automated 
semi-Markov and nonhomogene ous Markov models and the associated solution techniques, 
which have been proposed over the last several years, are surveyed and discussed by 
Geist and Trivedi, 1983, in terms of design limitations imposed by model assump- 
tions, efficiency and accuracy of the solution techniques employed, and the useful- 
ness of the types of solutions obtained. 

Although some authors suggest direct computational methods (Ng and Avizienis, 
1976), or approximate methods based on state aggregation techniques in detailed 
fault-handling models (Stiffler et al., 1979), White, 1984, suggests upper and lower 
bounds for system unreliability. The framework from which he derives the bounds is 
a semi-Markov model in which component failure times have exponential distributions 
and system response times have distributions with unspecified form. His bounds take 
a product form, with one set of factors depending on information concerning com- 
ponent failure rates and another set depending, in addition, on the means and vari- 
ances of the response times. 

In this paper the bounds given by White, 1984, are generalized to a model that 
provides for competing system responses to component failures. The model describes 



a system's history, which consists of a series of states entered over a period of 
time together with the time intervals between state changes. Entrance into a state 
may correspond to the occurrence of a component failure or to the occurrence of a 
system response to previously failed components. Competing responses are often of 
types such as detecting and deactivating previously failed components or activating 
spare components. The assumed model is semi-Markov, in which the potential 
(possible) sojourn times associated with component failures have exponential distri- 
butions and those associated with system responses have distributions with unspeci- 
fied form. 

Information concerning system response times may be available from experimental 
fault injection studies (Lala and Smith, 1983) or from analytical derivations of the 
response time distributions, as determined from specifications of sequential statis- 
tical tests that are often employed in system design (Walker, 1980). 

In section 3 a product form of the bounds is derived which has integral factors 
that depend on component failure rates and the distributions of system response 
times. This form permits substituting sample cumulative distributions and elimi- 
nates the need for certain intermediate stages of data analysis, such as checking 
the adequacy of assumed parametric forms. The simpler form given in section 4 may 
be useful when minimal information is available in the form of conditional mean 
response times, percentiles, and certain transition probabilities. To reduce fur- 
ther the information needed, certain nonparametric classes of response time distri- 
butions may be employed, as discussed in section 5. 


2. THE MODEL 

In the context of a particular application, a system state is a vector having 
elements that specify the number of operational components, the status of system 
response, and the current system configuration. It is convenient to label the 
system states simply as {1, 2, ..., k}. A certain subset R corresponds to states 

entered as a result of system responses to component failures, and the remaining 
set R corresponds to states entered when components fail. As illustrated in 
section 6, it is possible for some element of R to correspond to an absorbing 
state (system failure). 

A system's history consists of a series of states z Q , , ..., z n entered 
over a period of time together with the sojourn times (time intervals between state 
changes) u 1 , u 2 , ..., u R . Typically, the initial state z Q is a fully operational 
system state and z -j is entered when some component fails. The system may enter 
z 2 as a result of another component failure or as a result of system response to 
the first failure, and so on, giving a series of states in R and R. Successive 
responses may result from failure detection and subsequent deactivation of a failed 
component. Competing responses arise, for example when two components, say A and B, 
have failed and the potential responses are of types such as deactivate A versus 
deactivate B or activate the spare A versus deactivate the failed active unit B. 

If the process is semi-Markov, then the random variables Z Q , Z 1 , ..., Z n 

follow a Markov chain and the sojourn times U 1 , U 2 , U n are conditionally 

independent, given a particular series of state changes. The usual model specifica- 
tion (Lagakos et al., 1978) is given by the initial and transition probabilities and 
by the conditional distributions of the sojourn times as follows: 
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0(i) = P(Z 0 = i) 0(i , j ) = P(Z , = j | Z = i) 
0 m+1 1 m 

Q<*;i.D> - P(U mt1 < X |z m - l.z^, . j) 


Suppose that the system enters state z m-1 = i at the m-lst epoch. Let 
T(i,£) (£ = 1 , 2, ..., k) denote the potential (possible) sojourn times associated 

with the states £ = 1, 2, ..., k. Then, the system enters state j only if 

T(i,j) is the smallest of the potential sojourn times. The time between the m-lst 
and mth epochs is U m = min{T(i,1), T(i,2), ..., T(i,k)}. In particular, 0(i,j) 
Q(x?i,j) gives the probability that U m 1 x and T(i,j) 1 min{T(i,£)}, where 
£ * j, ..., given that the current system state is z ^ = i. 

If the potential sojourn times are independent and have continuous distribu- 
tions, then 


0 ( i , j ) dQ(x;i,j) = II . G ( x ; i , £) dG(x;i,j) (1) 

£?□ 

where G(x;i,£) = P{T(i,£) > x) represents the survivor functions. As mentioned 
earlier, the potential sojourn times T(i,£), £ e R associated with component fail- 
ures have exponential distributions in which the parameters A(i,£) depend on the 
adjoining states i and £ and the response times T(i,£), £ e R have distribu- 
tions G(x;i, £), £ e R with unspecified form. 

One question that arises is whether any generality is added by permitting 
dependent response times. Results given by Miller, 1977, and Tsiatis, 1975, show 
that if Q(x?i,j) is continuous, then independent random variables (T(i,£)} exist 
having distributions that satisfy equation (1). In particular, when Q(x?i,j) is 
continuous, the distributions of response times that satisfy equation (1) have 
survivor functions given by Tsiatis, 1975, as follows: 


G ( x ; i , j ) = exp 


i-r 


h(y;i,j) dy 


where 


h(y;i,j) = 0(i,j) dQ(y;i,j )/E £6R 6(i,£){1 - Q(y?i,£)} 


Ihus, it suffices to consider only the representation given by equation (1) 


3. GENERAL FORM OF THE BOUNDS 

As the model now stands, we have not included an expected large difference in 
the component failure and system response times. Fault-tolerant systems often 
employ highly reliable components and are designed for quick response to component 
failures; hence, the response times may often be stochastically much smaller than 
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failure times. This assumption is the basis for the computational techniques 
proposed by Stiff ler et al., 1979, and it is also the basis for the accuracy but not 
for the validity of the bounds given in the following discussion. 

Our derivation, similar to that given by White, 1984, consists, in parts of 
partitioning a particular event series z Q , z 1 , . .., z n according to the character 
of the potential sojourn times attached to Zq, z^ , ..., z n-1 . If i * j and if 
all potential sojourn times attached to z^_-j correspond to component failures, 
while those attached to z j_i include one or more system response times, then Uj 
is stochastically smaller than U^, providing, of course, that the potential 
response times are stochastically smaller than the potential failure times. The 
upper bound is obtained by excluding the stochastically smaller random variables 
from the hitting time, T = U 1 + U 2 + ... + U n , of z n ; thus, T is approximated by 
a sum of fewer variables. The lower bound is obtained in a similar way except that 
certain constants, chosen to represent upper percentiles of the response time dis- 
tributions, replace the previously excluded variables; this gives a new random 
variable that, effectively, is stochastically larger than T. 

Let Zq, z 1 , ..., z n represent a particular path leading to an absorbing 

state z n . Let A, B, and C partition the indices of Zq, z^ , ..., z n in the 
following way: i e A if all potential sojourn times leading from represent 

elapsed times to component failures; i e B if the particular potential sojourn 
time leading from z^_<j to z^ represents a response time; and i e C provided 
that i f A and the particular potential sojourn time leading from z i-i to z i 
represents an elapsed time to some component failure. 

The probability p(t) of hitting z n by time t and entering the series of 
s ta tes Zq, z<| , ..., z ^ is 

p(t) “ P(T = t, Zq — Zq, ..., Z^ = z ) 


where T = + . . . + U n is the hitting time of z n . This probability is given by 

P(t) = / 9(Z 0 ) n? 6(z . , z ) dQ(u ;z ,z ) (2) 

J C W 1 = 1 1-1 1 1 1—1 1 


where 


S { ( u ^ , U 2 / • • • / u n ) : u 1 t u 2 t ... t u n = t } 

Let A = EA^ denote the sum of an arbitrarily chosen set of nonnegative con- 
stants A^, i e BUC. Upper and lower bounds Py(t) and p^Ct), respectively, for 
p(t) follow by observing, as in White, 1984, that the sets 


S U •••r ) * ^A u i ^ ^ } 
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and 


S L = {^ u l ' u n^ : S A u i = " S, U£ i A if i e BUC} 

satisfy S L Cj S 9 Sy. 


Now, replacing S by Sy and S L in equation (2) gives, respectively, 

P D (t) = 6(z 0 ) H(t) U A a. Ilg b. 1^ c. (3) 

and 

P L (t) = 0(z o ) H(t - A) 1^ a ± n B b^ n c c[ (4) 

where, in terms of X j _ = E^- X(i,H), 


®i — ^( ^i— 1 ' z i ^ / ^i 


(5) 


b i = f 

•'n 


e ' Aiy dG<y!Z i-,’ z i> 

0 1 


(6) 


>i- /' 

•'o 


e ^i y X(z , ,z.) IT G(y;z. . , £) dy 

1-1 1 i 1-1 


(7) 


>; - p 

•'n 


b, = 1 e A i y n G(y;z i _ 1> £) dG(y; ,z ± ) 
'0 i 


( 8 ) 


- T A i -X v 

c i = J e i y X(z i _ 1 ,z i ) n £ G(y;z i _ 1 , l) dy 


(9) 


The function 


H(x) = p( i x) 


( 10 ) 


appearing in equations (3) and (4) represents the distribution function for a sum of 
independent random variables having exponential distributions with rate parameters 
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A i € A. The indices for each product shown in equations (6) to (9) vary only 
over the indices of the response time distributions. 

. . » » 

The quantities , c^, b^ , and are directly estimable whenever the 

response times are observed experimentally. The choice of estimates would vary 
depending on whether the response times are observed individually or observed as 
competing events. In the former case, substitution of censored data forms of the 
sample cumulative distributions would give nonparametric estimates* In the latter 
case, nonparametric estimates as described by Kalbfleisch and Prentice, 1980, would 
be applicable. The A^ would probably be chosen as points of censoring, hopefully 
at the extreme upper tails of the response time distributions. 


4. BOUNDS EXPRESSED IN TERMS OF TRANSITION PROBABILITIES, 

CONDITIONAL MEANS, AND PERCENTILES 

One application of the bounds given earlier assumes that component failure 
rates are known quantities and that certain minimal information is available con- 
cerning the distributions of response times. In white, 1984, the model is limited 
to the case in which a single response time is competing with component failures and 
the bounds are given in terms of means and variances. For the general case, it is 
unlikely that accurate bounds can be expressed solely in terms of means and vari- 
ances. The upper bounds given in this section require only information concerning 
certain transition probabilities and percentiles. The lower bounds require addi- 
tional information concerning the conditional mean minimum response times. 

Consider first the upper bound given by equation (3) . Since typically the rate 
parameters A(i,j) take quite small values, a fairly accurate upper bound is given 
by replacing b^ and c^ appearing in equation (3) by 


b 


1 i 



G( y? z 


i-1 


i) dG(y?z i> _ l ,z i ) 


( 11 ) 


i ■“ A( z^_^ , z^ ) ( A^ ) 


Ui( A t ) 


+ A^ A ( z^_<| , z^ ) G^(A^) 


( 12 ) 


+ A(z i _ 1 ,z i ) G i ( A^ ) A^ 


-1 


where 


W ■ n , 5 l ‘rV,'*) 


G. (i.) = 1 - G.(A. ) 

11 11 


U i (A i ) = [G i (A i )] 1 J 1 x dG i (x) 


(13) 

(14) 

(15) 
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Note that each represents a transition probability and is computed as if all 

component failure modes were eliminated at state • The b^ ^ takes a value 

equal to 1 whenever a single response time is competing with failure times. The 
quantity ^(A^) represents the conditional mean minimum response time given that 
the smallest response occurs in (0,A^). Since G^( A^) y^fA^) i G i (A i ) A^, the 
upper bound can be computed without knowledge of y^(A^). In this case the informa- 
tion needed to compute the upper bound consists of the transition probabilities and 
the probability that the smallest response time exceeds A^. 

Next, a new lower bound is given by replacing each of b^ and c^ r appearing 
in equation (4) by the quantities 


b' = exp^A^) {b u - G (A >} 


and 


(16) 


c' = A(z . ,,z.) exp(-A. A. ) (G.(A.) p.(A.) + A. G.(A.)} 
1i i-I l 1111 M i l ill 


(17) 


An optimal choice of the A^ to minimize P u (t) - P^(t) would probably 
require some knowledge of the form of the distributions of response times. The sim- 
pie results, b- . - b\i i + G.(A. ) and c 1 i - i X. A. +G.(A.), show that 

bn - bi i and c-j ^ - c^i each converge to 0 as X^ and G^ ( A^ ) decrease to 0, 
Also, with r representing the number of terms in it is not difficult to 

show that H(t) - H(t - A) is dominated by {t r - (t - A) r } II^X^. 

This analysis suggests that if the system components are highly reliable and if 
the system is designed for quick response to component failures, then tight bounds 
would often be given by choosing A^ equal to large percentiles of the distribu- 
tions of minimum response times. Exact methods for computing H(x) are available 
in several introductory- level texts that discuss time homogeneous Markov processes. 


5. BOUNDS EXPRESSED IN TERMS OF PERCENTILES AND CONDITIONAL MEANS 

The bounds given in this section rely on weak assumptions relating the response 
time distributions. The aim is to reduce the information needed to compute the 
bounds. This is done by replacing b^ ^ appearing in equations (11) and (16) by 
other quantities, depending on the percentiles of the response time distributions. 

Let F( • ) represent a continuous baseline distribution and let C^ denote the 
class of continuous distributions generated by F( * ) in the following way: G( • ) 

belongs to C^ if G(x) = (F(x) } a from some value of a > 0 and all values of 
x ^ 0. The class C-j is often described as the class of distributions having 
proportional hazard rates. The class is nonparametric in the sense that it involves 
an unspecified form of a baseline distribution. It has been studied extensively 
(Kalbf leisch and Prentice f 1980) as a framework for developing nonparametric statis- 
tical methods. 
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In the present context, suppose that the response time distributions belong to 
C 1 for some unspecified baseline distribution. Then, each survivor function has 
the representation 


<t>(z. , £) 

G(x;z i _ 1 , £,) = {G ^ ( x ) } 


(18) 


where 


0 < <j>(z i _ 1 , l) < 1 

£ £ <£( z i_1 t — 1 

The survivor function for the minimum response time is given by 

GAx) = II G(x;z i _ 1 , £) (19) 

Upon replacing b^ appearing in equations (11) and (16) by 

b-j i = (f>( t ) 

= [log G( A i ;z i _ 1 ,z i )]/[Z Jl log G( A^z^ , £)] (2°) 


we get upper bounds that require only information concerning the probabilities that 
the response times exceed The lower bounds still depend on the conditional 

mean minimum response time. 

Now consider a second class C 2 , generated from a continuous baseline distri- 
bution F(*) in the following way: G( # ) belongs to C 2 if G(x) = F a (x) for 
some value of a > 0 and all values of x i 0. This class is also nonparametric in 
the same sense as before? however, it appears to have been studied less as a basis 
for nonparametric inference. 

If the response time distributions belong to C 2 for some unspecified 
baseline distribution, then each has the representation 

4><z. , JL) 

G(x; Zi _ 1r *) = (x) } (21) 
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where 


0 < t|i(z , £) < 

Z Z * (z i-1 ' l) = 1 
The distribution function 

H^Cx) = G (x; z 


) 

for the largest of the response times is given by 


( 22 ) 


Substituting from equation (21) gives 

/•I ( , £)) i|»(z ,z.)-1 

b ii = J o n ^ z . j 1 - u j u du (23) 

Upon expanding the product in the integrand, b^ ^ can be written as a sum of terms 
involving only the quantities ^(z i _ 1 ,£). Also, from equation (21) each ^(z^_ 1 ,£) 
can be represented in the form 


*<*!-•, »£> = [log G(A i ?z i _ 1 ,£)]/[E £ log G( A ± ? z ± _ 1 , £)] (24) 

Therefore, b^ as given by equations (23) and (24) depends only on the percentiles 
of the response time distributions and can be substituted in equations (11) and (16) 
to give a new set of bounds. 


6. AN EXAMPLE 

The example to be discussed is concerned with the effect on system reliability 
of a particular choice of interval for cycling a spare and serves to illustrate the 
application of the bounds given in section 4. 

Consider a system having three active processor units and one spare. Active 
units have a failure rate X and the spare has a failure rate y. The output of 
the active units is subject to majority vote; thus, the system survives with one 
failed active unit. The spare and one predesignated active unit form a cooperating 
pair. To check its operational status, the spare is automatically activated and 
switched with the cooperating unit at regular intervals. The spare is also 
activated whenever a failed unit is detected and, in this case, it replaces the 
failed unit. 

The desire to check the operational status of the spare leaves open the possi- 
bility of cycling in a failed spare at some instant when a noncooperating unit has 
failed. As shown in figure 1 , one of the noncooperating active units fails 
(state 1), the spare fails (state 2), and the system, being unaware that either unit 
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has failed, automatically switches the spare with the good active unit (state 3) . 
State 3, as well as states 5 and 7, represents system failure since the system is 
not fault tolerant at any instant when two of the three active units have failed. 
States 6 and 8 designate operational states that are attained when the system 
detects, identifies, and retires the failed active unit and then replaces it with 
the spare. 

In terms of the previous notation, z Q = 0, = 1 , z 2 = 2, z 3 = 3, A = {1}, 

B = {3}, and C = {2}. For the sake of simplicity, take A^ = A and assume that 
the response time distributions G(x;1,6) and G(x;2,8) are identical* The time 
(measured from the instant of entering state 2 ) needed to switch the spare to active 
status has a distribution limited to_ (0,A); that is, A is chosen equal to the 
length of the cycling interval and G(A;2,3) = 0. 

The bounds have the form 


Py(t) — H(t) a i b-j 3 ^ 2 


and 

P L (t) = H(t - A) a-jb*} 2 C \ 2 

where a 1 = 2A(3A + y )~ 1 , A = 2A, H(x) = 1 - exp{-(3A + y)x}, and b , c^, 
b^ 3 , and c^ 2 are given by equations ( 11 ), ( 12 ), (16), and (17), respectively: 

b 13 = f G(x? 2,8) dG ( x ; 2 , 3 ) 

C 1 2 = y{G 2 (A) y 2 (A) + * 2 U)} + y(2A + y) 1 A) 


b 13 = { b i 3 “ G 3 (A)} exp( -2 AA) 


c 12 = p( g 2 (A) y 2 (A) + AG 2 (A)} exp[ - ( 2 X + y) A] 


To compare the upper and lower bounds for the probability of hitting state 3 
prior to completing the mission in 1 hr (t = 1 ), suppose that A = 0.003, 
y = A = 0.001, and experimental results give G(A;1,6) = 0.04, y 2 (A) = 0.002, 
and b 13 = 0.68. Then, G 2 (A) = 0.04, G 3 (A) = 0, and upper and lower bounds are 
Py(t) =1.81 x 10~ 5 and P L (t) = 2.74 x 10“ 9 , respectively. 

The difference between the upper and lower bounds is largely due to the dif- 
ference in c 1 2 and c^ 2 , but this in turn can be attributed to a lack of infor- 
mation concerning the shapes of the distributions above the limit A. 
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is a uniform distribution over (0,A), then 
the bounds; in this case. 


a" 1 {G 2 (A) u 2 (A) + A G 2 U)} 


CONCLUDING REMARKS 

In this paper a semi-Markov model has been analyzed to give upper and lower 
bounds for system unreliability* The model provides for multiple competing system 
responses to component failures and is flexible in terms of describing the distri- 
butions of system response times. We have shown that accuracy of the bounds 
increases in the limit as the component failure rates and as the survivor functions 

of minimum response times decrease to 0. Thus, generally if the response time 

distributions are concentrated over a narrow range, accurate bounds would be given 

by selecting percentiles at the upper end of this range. The best choice of param- 

eters for representing the bounds depends on the available information; in the 
experimental context, percentiles and conditional means appear preferable to other 
parameters because of the ease of substituting censoring points for percentiles and 
the ease of directly estimating the conditional mean response times. 


If it is assumed that G(x;2,3) 
less information is needed to compute 


•’ /' 


b^ 2 = A ' I G(x;2,3) dx = 

; o 


and substitution gives b ^ 3 = 0 . 68 , 


NASA Langley Research Center 
Hampton, VA 23665 
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